Rey et al. 2020 data analysis
Resources
- https://deneflab.github.io/MicrobeMiseq/demos/mothur_2_phyloseq.html
- https://joey711.github.io/phyloseq/plot_ordination-examples.html
- https://david-barnett.github.io/microViz/articles/web-only/ordination.html
- https://www.gdc-docs.ethz.ch/MDA/handouts/MDA20_PhyloseqFormation_Mahendra_Mariadassou.pdf
- http://rstudio-pubs-static.s3.amazonaws.com/266780_cac4994322494658904507a7606b1dd8.html
Running this notebook
To run run this notebook in its entirety, clone or download the iobis/pacman-pipeline-training
repository. The notebook can be found in the
datasets/rey/analysis directory.
Importing data
The PacMAN pipeline exports results as a phyloseq object
which can be imported into R for analysis. This is how
phyloseq objects are structured:
Read the phyloseq object and verify that we have a OTU
table, a sample table, and a taxonomy table. A phylogenetic tree has not
been calculated so that slot is empty.
physeq <- readRDS("../results_noblast/phyloseq_object.rds")
physeq## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 5327 taxa and 22 samples ]
## sample_data() Sample Data: [ 22 samples by 20 sample variables ]
## tax_table() Taxonomy Table: [ 5327 taxa by 14 taxonomic ranks ]
While we can access these tables individually (for example using
physeq@sam_data), the phyloseq package also has a function
psmelt() to convert the phyloseq object into a
single large data frame:
library(dplyr)
df <- psmelt(physeq) %>%
# convert empty strings to NA across all character columns
mutate_if(is.character, ~na_if(., "")) %>%
# convert to tibble for prettier printing
as_tibble()
df## # A tibble: 117,194 × 37
## OTU Sample Abund…¹ eventID mater…² event…³ verba…⁴ local…⁵ verba…⁶ event…⁷
## <chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 asv.2 SAMN1… 25147 SAMN10… SAMN10… settle… 43.28 … Spain:… 7m 2017-1…
## 2 asv.7 SAMN1… 16275 SAMN10… SAMN10… settle… 43.28 … Spain:… 7m 2017-0…
## 3 asv.8 SAMN1… 15862 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m 2017-0…
## 4 asv.6 SAMN1… 15401 SAMN10… SAMN10… filter… 43.34 … Spain:… surface 2017-0…
## 5 asv.5 SAMN1… 12944 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m 2017-1…
## 6 asv.11 SAMN1… 12292 SAMN10… SAMN10… settle… 43.34 … Spain:… 7m 2017-0…
## 7 asv.14 SAMN1… 11142 SAMN10… SAMN10… filter… 43.37 … Spain:… surface 2017-0…
## 8 asv.10 SAMN1… 10447 SAMN10… SAMN10… filter… 43.37 … Spain:… surface 2017-0…
## 9 asv.3 SAMN1… 10377 SAMN10… SAMN10… settle… 43.34 … Spain:… 7m 2017-1…
## 10 asv.18 SAMN1… 9850 SAMN10… SAMN10… settle… 43.37 … Spain:… 7m 2017-0…
## # … with 117,184 more rows, 27 more variables: occurrenceStatus <chr>,
## # target_gene <chr>, subfragment <chr>, pcr_primer_forward <chr>,
## # pcr_primer_reverse <chr>, pcr_primer_name_forw <chr>,
## # pcr_primer_name_reverse <chr>, pcr_primer_reference <chr>,
## # lib_layout <chr>, seq_meth <chr>, sop <chr>, votu_db <chr>, My_name <chr>,
## # kingdom <chr>, phylum <chr>, class <chr>, order <chr>, family <chr>,
## # genus <chr>, species <chr>, lastvalue <chr>, otu_seq_comp_appr <chr>, …
Exploratory data analysis
Abundance by phylum
Let’s first determine which are the most abundant phyla across all
samples. We will use this information to bin the less abundant phyla
into a category Other for simplifying some of our
graphics.
stats_phyla <- df %>%
group_by(phylum) %>%
summarize(abundance = sum(Abundance)) %>%
arrange(desc(abundance))
top_phyla <- head(na.omit(stats_phyla$phylum), 8)
top_phyla## [1] "Arthropoda" "Cnidaria" "Bacillariophyta" "Annelida"
## [5] "Bryozoa" "Nemertea" "Chlorophyta" "Mollusca"
Now create a color scale:
phylum_colors <- RColorBrewer::brewer.pal(9, "Spectral")
names(phylum_colors) <- c(top_phyla, "Other")
phylum_colors## Arthropoda Cnidaria Bacillariophyta Annelida Bryozoa
## "#D53E4F" "#F46D43" "#FDAE61" "#FEE08B" "#FFFFBF"
## Nemertea Chlorophyta Mollusca Other
## "#E6F598" "#ABDDA4" "#66C2A5" "#3288BD"
And add a new column with binned phyla:
df <- df %>%
mutate(phylum_binned = ifelse(is.na(phylum) | phylum %in% top_phyla, phylum, "Other")) %>%
mutate(phylum_binned = factor(phylum_binned, levels = c(top_phyla, "Other")))Let’s also add the binned phyla to the phyloseq object
for later use:
# remotes::install_github("david-barnett/microViz")
physeq <- physeq %>%
microViz::tax_mutate(phylum = ifelse(phylum == "", NA, phylum)) %>%
microViz::tax_mutate(phylum_binned = ifelse(is.na(phylum) | phylum %in% top_phyla, phylum, "Other"))Abundance by sample type and phylum
First list the abundance by phylum and sample type:
library(tidyr)
stats_type_phylum <- df %>%
group_by(eventRemarks, phylum) %>%
summarize(abundance = sum(Abundance))
spread(stats_type_phylum, eventRemarks, abundance) %>%
mutate(total = `settlement plates` + `filtered water`) %>%
arrange(desc(total)) %>%
knitr::kable()| phylum | filtered water | settlement plates | total |
|---|---|---|---|
| NA | 353725 | 126557 | 480282 |
| Arthropoda | 13857 | 99041 | 112898 |
| Cnidaria | 3603 | 60871 | 64474 |
| Bacillariophyta | 45861 | 979 | 46840 |
| Annelida | 8706 | 15085 | 23791 |
| Bryozoa | 454 | 22631 | 23085 |
| Nemertea | 7 | 20234 | 20241 |
| Chlorophyta | 17922 | 10 | 17932 |
| Mollusca | 3801 | 6219 | 10020 |
| Haptista | 4682 | 13 | 4695 |
| Rhodophyta | 3082 | 197 | 3279 |
| Rotifera | 1926 | 56 | 1982 |
| Oomycota | 1303 | 414 | 1717 |
| Basidiomycota | 1481 | 18 | 1499 |
| Chordata | 1213 | 211 | 1424 |
| Platyhelminthes | 11 | 1382 | 1393 |
| Prasinodermophyta | 1071 | 0 | 1071 |
| Echinodermata | 767 | 53 | 820 |
| Ascomycota | 733 | 3 | 736 |
| Streptophyta | 444 | 0 | 444 |
| Porifera | 281 | 115 | 396 |
| Discosea | 173 | 63 | 236 |
| Nematoda | 21 | 95 | 116 |
| Placozoa | 96 | 0 | 96 |
| Phoronida | 57 | 18 | 75 |
| Evosea | 62 | 7 | 69 |
| Gastrotricha | 6 | 32 | 38 |
| Picozoa | 28 | 0 | 28 |
| Chytridiomycota | 18 | 0 | 18 |
| Sipuncula | 12 | 0 | 12 |
| Imbricatea | 0 | 7 | 7 |
| Chaetognatha | 5 | 0 | 5 |
| Entoprocta | 0 | 4 | 4 |
| Tubulinea | 3 | 0 | 3 |
| Onychophora | 2 | 0 | 2 |
| Zoopagomycota | 2 | 0 | 2 |
Now create a graph using the binned phyla:
library(ggplot2)
stats_type_phylum <- df %>%
# calculate total abundance by sample type
group_by(eventRemarks) %>%
mutate(abundance = Abundance / sum(Abundance)) %>%
group_by(eventRemarks, phylum_binned) %>%
summarize(abundance = sum(abundance))
ggplot() +
geom_bar(data = stats_type_phylum, aes(y = eventRemarks, fill = phylum_binned, x = abundance), stat = "identity") +
scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
theme_minimal()Abundance (reads) by sample and phylum
stats_sample_phylum <- df %>%
# calculate total abundance by sample type
group_by(Sample) %>%
mutate(relative_abundance = Abundance / sum(Abundance)) %>%
group_by(Sample, eventRemarks, phylum_binned) %>%
summarize(relative_abundance = sum(relative_abundance), abundance = sum(Abundance))Absolute reads abundance:
ggplot() +
geom_bar(data = stats_sample_phylum, aes(y = Sample, fill = phylum_binned, x = abundance), stat = "identity") +
scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
theme_minimal() +
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())Relative reads abundance:
ggplot() +
geom_bar(data = stats_sample_phylum, aes(y = Sample, fill = phylum_binned, x = relative_abundance), stat = "identity") +
scale_fill_manual(values = phylum_colors, na.value = "#eeeeee") +
theme_minimal() +
theme(panel.grid.major.y = element_blank(), panel.grid.minor.y = element_blank())Most abundant species
Let’s list the species with the highest relative abundance across all samples:
top_species <- df %>%
filter(!is.na(species)) %>%
group_by(species) %>%
summarize(abundance = sum(Abundance)) %>%
arrange(desc(abundance)) %>%
head(20)
top_species %>% knitr::kable()| species | abundance |
|---|---|
| Balanus trigonus | 37896 |
| Bougainvillia muscus | 33490 |
| Minutocellus polymorphus | 27149 |
| Obelia dichotoma | 19006 |
| Emplectonema gracile | 16805 |
| Micromonas pusilla | 15598 |
| Dictyocha speculum | 13538 |
| Bugula neritina | 10243 |
| Harpyia umbrosa | 9099 |
| Cutleria multifida | 5503 |
| Ostrea stentina | 5267 |
| Polydora neocaeca | 5066 |
| Paracalanus parvus | 4374 |
| Sabellaria spinulosa | 3627 |
| Amphibalanus eburneus | 3567 |
| Palmaria decipiens | 2398 |
| Amphinema dinema | 2084 |
| Chloroparvula pacifica | 1528 |
| Pseudochattonella farcimen | 1509 |
| Campanularia hincksii | 1478 |
For the most abundant species, plot the abundance by sample:
stats <- df %>%
filter(species %in% top_species$species) %>%
group_by(species, eventRemarks, Sample) %>%
summarize(abundance = sum(Abundance))
ggplot() +
geom_jitter(data = stats, aes(y = species, x = abundance, color = eventRemarks), pch = 21, height = 0.1, width = 0) +
scale_color_brewer(palette = "Set1") +
theme_minimal()Invasive species
In this section we will check our results against the World Register of Introduced Marine Species (WRiMS). The repository contains a CSV file containing the LSIDs for all species in WRiMS.
wrims_lsids <- read.csv("wrims_lsids.csv")
invasives <- df %>%
filter(lsid %in% wrims_lsids$lsid)
invasives %>%
group_by(phylum, species) %>%
summarize(abundance = sum(Abundance)) %>%
arrange(desc(abundance)) %>%
rmarkdown::paged_table()ASV accumulation curves
To do.
Alpha and beta diversity
To do.
Multidimensional scaling with phyloseq
Phyloseq’s ordinate() function allows us to do
MultiDimensional Scaling (MDS) or Principal Coordinate Analysis (PCoA)
on our dataset. The default for this function is to use Bray-Curtis
Dissimilarity to calculate a distance matrix.
ord <- ordinate(physeq = physeq, method = "PCoA", distance = "bray")
plot_ordination(physeq = physeq, ordination = ord, type = "samples", color = "eventRemarks", shape = "locality") +
geom_point(aes(color = eventRemarks), alpha = 1, size = 4) +
geom_text(aes(label = materialSampleID), alpha = 0.7, size = 3, vjust = 2) +
stat_ellipse(aes(group = eventRemarks)) +
scale_color_brewer(palette = "Set1") +
scale_shape(solid = TRUE) +
theme_classic()We can also add taxa to the ordination plot, but styling this is a bit awkward as samples and taxa are treated as a single dataset.
plot_ordination(physeq = subset_taxa(physeq, !is.na(phylum)), ordination = ord, type = "biplot", color = "phylum_binned") +
scale_color_manual(values = phylum_colors, na.value = "#000000") +
geom_text(aes(label = materialSampleID), alpha = 0.7, size = 3, vjust = 2) +
theme_classic()Non-metric MDS (NMDS) with phyloseq
Another ordination method is non-metric MDS (NMDS). Let’s try NMDS with our dataset aggregated at the species level, and plot the most abundant species together with the sites.
First aggregates the phyloseq object to the species
level:
physeq_species <- tax_glom(physeq, taxrank = "species", NArm = TRUE, bad_empty = c(NA, ""))Then perform the ordination with method NMDS:
ord_species <- ordinate(physeq_species, method = "NMDS", distance = "bray")Extract the NMDS values for the sites and species, and join with the
relevant tables from the phyloseq object to get the
necessary metadata for plotting (such as species name, locality, and
sample type):
sites <- ord_species$points %>%
as.data.frame() %>%
tibble::rownames_to_column("site") %>%
inner_join(physeq_species@sam_data %>% as.data.frame() %>% tibble::rownames_to_column("site"), by = "site")
species <- ord_species$species %>%
as.data.frame() %>%
tibble::rownames_to_column("asv") %>%
left_join(physeq_species@tax_table %>% as.data.frame() %>% tibble::rownames_to_column("asv"), by = "asv") %>%
filter(species %in% top_species$species)ggplot() +
geom_segment(data = species, aes(x = 0, y = 0, xend = MDS1, yend = MDS2), arrow = arrow(length = unit(2, "mm")), color = "#aaaaaa") +
geom_point(data = sites, aes(MDS1, MDS2, color = eventRemarks, shape = locality), size = 3, stroke = 1.5) +
geom_text(data = species, aes(MDS1, MDS2, label = species), size = 3, vjust = "outward") +
scale_color_brewer(palette = "Set1") +
scale_shape(solid = FALSE) +
theme_classic()